Dataset Description

The features were extracted from the silhouettes by the HIPS (Hierarchical Image Processing System) extension BINATTS, which extracts a combination of scale independent features utilising both classical moments based measures such as scaled variance, skewness and kurtosis about the major/minor axes and heuristic measures such as hollows, circularity, rectangularity and compactness.

Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.

Problem statement

The objective is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.

Data Dictionary

  • 1- compactness

  • 2- circularity

  • 3- distance_circularity

  • 4- radius_ratio

  • 5- pr.axis_aspect_ratio

  • 6- max.length_aspect_ratio

  • 7- scatter_ratio

  • 8- elongatedness

  • 9- pr.axis_rectangularity

  • 10- max.length_rectangularity

  • 11- scaled_variance

  • 12- scaled_variance.1

  • 13- scaled_radius_of_gyration

  • 14- scaled_radius_of_gyration.1

  • 15- skewness_about

  • 16- skewness_about.1

  • 17- skewness_about.2

  • 18- hollows_ratio

  • 19- class

1. Import the Libraries

In [1]:
#For numerical libraries
import numpy as np
#To handle data in the form of rows and columns
import pandas as pd
#importing seaborn for statistical plots
import seaborn as sns
#importing ploting libraries
import matplotlib.pyplot as plt
#styling figures
plt.rc('font',size=14)
sns.set(style='white')
sns.set(style='whitegrid',color_codes=True)
#To enable plotting graphs in Jupyter notebook
%matplotlib inline
#importing the Encoding library
from sklearn.preprocessing import LabelEncoder
#Build the model with the best hyper parameters
from sklearn.model_selection import cross_val_score
#importing the zscore for scaling
from scipy.stats import zscore
#Importing PCA for dimensionality reduction and visualization
from sklearn.decomposition import PCA
# Import Support Vector Classifier machine learning library
from sklearn.svm import SVC
#Import Sklearn package's data splitting function which is based on random function
from sklearn.model_selection import train_test_split
#Grid search to tune model parameters for SVC
from sklearn.model_selection import GridSearchCV
# Import the metrics
from sklearn import metrics

2. Load the dataset

In [2]:
#reading the CSV file into pandas dataframe
vehicle_df=pd.read_csv('vehicle.csv')
In [3]:
#Check top 5 records of the dataset
vehicle_df.head()
Out[3]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 van
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 van
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 car
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 van
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 bus
  • It shows that there are 18 independent variables (compactness, circularity, distance_circularity, radius_ratio, pr.axis_aspect_ratio, max.length_aspect_ratio, scatter_ratio, elongatedness, pr.axis_rectangularity, max.length_rectangularity, scaled_variance, scaled_variance.1, scaled_radius_of_gyration, scaled_radius_of_gyration.1, skewness_about, skewness_about.1, skewness_about.2, hollows_ratio) and one dependent variable (class).
  • Only the class variable is non numeric others are numeric.
In [4]:
#Check the last 5 records of the dataset
vehicle_df.tail()
Out[4]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
841 93 39.0 87.0 183.0 64.0 8 169.0 40.0 20.0 134 200.0 422.0 149.0 72.0 7.0 25.0 188.0 195 car
842 89 46.0 84.0 163.0 66.0 11 159.0 43.0 20.0 159 173.0 368.0 176.0 72.0 1.0 20.0 186.0 197 van
843 106 54.0 101.0 222.0 67.0 12 222.0 30.0 25.0 173 228.0 721.0 200.0 70.0 3.0 4.0 187.0 201 car
844 86 36.0 78.0 146.0 58.0 7 135.0 50.0 18.0 124 155.0 270.0 148.0 66.0 0.0 25.0 190.0 195 car
845 85 36.0 66.0 123.0 55.0 5 120.0 56.0 17.0 128 140.0 212.0 131.0 73.0 1.0 18.0 186.0 190 van

3. Data Preprocessing

Understanding the data

Data types and data description

In [5]:
#To show the detailed summary 
vehicle_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 846 entries, 0 to 845
Data columns (total 19 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   compactness                  846 non-null    int64  
 1   circularity                  841 non-null    float64
 2   distance_circularity         842 non-null    float64
 3   radius_ratio                 840 non-null    float64
 4   pr.axis_aspect_ratio         844 non-null    float64
 5   max.length_aspect_ratio      846 non-null    int64  
 6   scatter_ratio                845 non-null    float64
 7   elongatedness                845 non-null    float64
 8   pr.axis_rectangularity       843 non-null    float64
 9   max.length_rectangularity    846 non-null    int64  
 10  scaled_variance              843 non-null    float64
 11  scaled_variance.1            844 non-null    float64
 12  scaled_radius_of_gyration    844 non-null    float64
 13  scaled_radius_of_gyration.1  842 non-null    float64
 14  skewness_about               840 non-null    float64
 15  skewness_about.1             845 non-null    float64
 16  skewness_about.2             845 non-null    float64
 17  hollows_ratio                846 non-null    int64  
 18  class                        846 non-null    object 
dtypes: float64(14), int64(4), object(1)
memory usage: 125.7+ KB
  • It gives the information about the number of rows, number of columns, data types , memory usage, number of null values in each columns.
In [6]:
#Analyze the distribution of the dataset
vehicle_df.describe().T
Out[6]:
count mean std min 25% 50% 75% max
compactness 846.0 93.678487 8.234474 73.0 87.00 93.0 100.0 119.0
circularity 841.0 44.828775 6.152172 33.0 40.00 44.0 49.0 59.0
distance_circularity 842.0 82.110451 15.778292 40.0 70.00 80.0 98.0 112.0
radius_ratio 840.0 168.888095 33.520198 104.0 141.00 167.0 195.0 333.0
pr.axis_aspect_ratio 844.0 61.678910 7.891463 47.0 57.00 61.0 65.0 138.0
max.length_aspect_ratio 846.0 8.567376 4.601217 2.0 7.00 8.0 10.0 55.0
scatter_ratio 845.0 168.901775 33.214848 112.0 147.00 157.0 198.0 265.0
elongatedness 845.0 40.933728 7.816186 26.0 33.00 43.0 46.0 61.0
pr.axis_rectangularity 843.0 20.582444 2.592933 17.0 19.00 20.0 23.0 29.0
max.length_rectangularity 846.0 147.998818 14.515652 118.0 137.00 146.0 159.0 188.0
scaled_variance 843.0 188.631079 31.411004 130.0 167.00 179.0 217.0 320.0
scaled_variance.1 844.0 439.494076 176.666903 184.0 318.00 363.5 587.0 1018.0
scaled_radius_of_gyration 844.0 174.709716 32.584808 109.0 149.00 173.5 198.0 268.0
scaled_radius_of_gyration.1 842.0 72.447743 7.486190 59.0 67.00 71.5 75.0 135.0
skewness_about 840.0 6.364286 4.920649 0.0 2.00 6.0 9.0 22.0
skewness_about.1 845.0 12.602367 8.936081 0.0 5.00 11.0 19.0 41.0
skewness_about.2 845.0 188.919527 6.155809 176.0 184.00 188.0 193.0 206.0
hollows_ratio 846.0 195.632388 7.438797 181.0 190.25 197.0 201.0 211.0
  • It gives the descriptive statistics (mean, median, mode, percentiles, min, max, standard deviation) of the columns of the dataset.
  • By analysing it, we can see that

    -compactness, circularity, distance_circularity, elongatedness, pr.axis_rectangularity, max.length_rectangularity, scaled_radius_of_gyration, scaled_radius_of_gyration.1, skewness_about.2, hollows_ratio are approximately normally distributed.

    -radius_ratio, pr.axis_aspect_ratio, max.length_aspect_ratio, scatter_ratio, scaled_variance, scaled_variance.1, skewness_about, skewness_about.1 are approx. right skewed distribution.

In [7]:
#It shows data types of columns
vehicle_df.dtypes
Out[7]:
compactness                      int64
circularity                    float64
distance_circularity           float64
radius_ratio                   float64
pr.axis_aspect_ratio           float64
max.length_aspect_ratio          int64
scatter_ratio                  float64
elongatedness                  float64
pr.axis_rectangularity         float64
max.length_rectangularity        int64
scaled_variance                float64
scaled_variance.1              float64
scaled_radius_of_gyration      float64
scaled_radius_of_gyration.1    float64
skewness_about                 float64
skewness_about.1               float64
skewness_about.2               float64
hollows_ratio                    int64
class                           object
dtype: object
In [8]:
#class attribute is not an object it is a category
vehicle_df['class']=vehicle_df['class'].astype('category')
In [9]:
#To get the shape 
vehicle_df.shape
Out[9]:
(846, 19)
  • It shows the shape of the dataset i.e. there are 846 rows and 19 columns.
In [10]:
#To get the number of columns
vehicle_df.columns
Out[10]:
Index(['compactness', 'circularity', 'distance_circularity', 'radius_ratio',
       'pr.axis_aspect_ratio', 'max.length_aspect_ratio', 'scatter_ratio',
       'elongatedness', 'pr.axis_rectangularity', 'max.length_rectangularity',
       'scaled_variance', 'scaled_variance.1', 'scaled_radius_of_gyration',
       'scaled_radius_of_gyration.1', 'skewness_about', 'skewness_about.1',
       'skewness_about.2', 'hollows_ratio', 'class'],
      dtype='object')

Checking for Missing Values

In [11]:
#Checking for missing values in the dataset
vehicle_df.isnull().sum()
Out[11]:
compactness                    0
circularity                    5
distance_circularity           4
radius_ratio                   6
pr.axis_aspect_ratio           2
max.length_aspect_ratio        0
scatter_ratio                  1
elongatedness                  1
pr.axis_rectangularity         3
max.length_rectangularity      0
scaled_variance                3
scaled_variance.1              2
scaled_radius_of_gyration      2
scaled_radius_of_gyration.1    4
skewness_about                 6
skewness_about.1               1
skewness_about.2               1
hollows_ratio                  0
class                          0
dtype: int64
In [12]:
#replace missing variable('?') into null variable using numpy
vehicle_df = vehicle_df.replace(' ', np.nan)

Handling the Missing values

In [13]:
#Replacing the missing values by median 
for i in vehicle_df.columns[:17]:
    median_value = vehicle_df[i].median()
    vehicle_df[i] = vehicle_df[i].fillna(median_value)
  • As some data are rightly skewed so we will use median to handle the missing values.
In [14]:
# again check for missing values
vehicle_df.isnull().sum()
Out[14]:
compactness                    0
circularity                    0
distance_circularity           0
radius_ratio                   0
pr.axis_aspect_ratio           0
max.length_aspect_ratio        0
scatter_ratio                  0
elongatedness                  0
pr.axis_rectangularity         0
max.length_rectangularity      0
scaled_variance                0
scaled_variance.1              0
scaled_radius_of_gyration      0
scaled_radius_of_gyration.1    0
skewness_about                 0
skewness_about.1               0
skewness_about.2               0
hollows_ratio                  0
class                          0
dtype: int64
In [15]:
# Again check data information
vehicle_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 846 entries, 0 to 845
Data columns (total 19 columns):
 #   Column                       Non-Null Count  Dtype   
---  ------                       --------------  -----   
 0   compactness                  846 non-null    int64   
 1   circularity                  846 non-null    float64 
 2   distance_circularity         846 non-null    float64 
 3   radius_ratio                 846 non-null    float64 
 4   pr.axis_aspect_ratio         846 non-null    float64 
 5   max.length_aspect_ratio      846 non-null    int64   
 6   scatter_ratio                846 non-null    float64 
 7   elongatedness                846 non-null    float64 
 8   pr.axis_rectangularity       846 non-null    float64 
 9   max.length_rectangularity    846 non-null    int64   
 10  scaled_variance              846 non-null    float64 
 11  scaled_variance.1            846 non-null    float64 
 12  scaled_radius_of_gyration    846 non-null    float64 
 13  scaled_radius_of_gyration.1  846 non-null    float64 
 14  skewness_about               846 non-null    float64 
 15  skewness_about.1             846 non-null    float64 
 16  skewness_about.2             846 non-null    float64 
 17  hollows_ratio                846 non-null    int64   
 18  class                        846 non-null    category
dtypes: category(1), float64(14), int64(4)
memory usage: 120.0 KB
  • Now, as it shows there are no misisng values present in the features.

Understanding the outier using boxplot

In [16]:
# Understand the spread and outliers in dataset using boxplot
vehicle_df.boxplot(figsize=(35,15))
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x207689e8d60>

It is showing that there are some columns which contains outliers such as radius_ratio, pr.axis_aspect_ratio, max.length_aspect_ratio, scaled_variance, scaled_variance.1, skewness_about, skewness_about.1.

In [17]:
# Histogram 
vehicle_df.hist(figsize=(15,15))
Out[17]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000002076899B400>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002076966A3D0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000020768B97D60>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000020768B97D90>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000020768BBF7F0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000020768C07D00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000020768C307C0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000020768C59280>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000020768C78D30>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000020768C9F820>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000020768CC8310>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000020768CE5DF0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000020768D0E940>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000020768D39430>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000020768D57EB0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000020768D81970>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000020768DA9460>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000020768DC8EE0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000020768DF0A00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000020768DF0B50>]],
      dtype=object)
  • It is also the distribution of the attributes.

Handling the outlier

In [18]:
#find the outliers and replace them by median
for col_name in vehicle_df.columns[:-1]:
    q1 = vehicle_df[col_name].quantile(0.25)
    q3 = vehicle_df[col_name].quantile(0.75)
    iqr = q3 - q1
    
    low = q1-1.5*iqr
    high = q3+1.5*iqr
    
    vehicle_df.loc[(vehicle_df[col_name] < low) | (vehicle_df[col_name] > high), col_name] = vehicle_df[col_name].median()
In [19]:
# again check for outliers in dataset using boxplot
vehicle_df.boxplot(figsize=(35,15))
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x20768e51070>
  • It shows that after handing outliers there are no outliers present in the dataset.

4. Understanding the attributes

Dependent Attribute

class

In [20]:
print('Class: \n', vehicle_df['class'].unique())
Class: 
 [van, car, bus]
Categories (3, object): [van, car, bus]
In [21]:
vehicle_df['class'].value_counts()
Out[21]:
car    429
bus    218
van    199
Name: class, dtype: int64

Univariate Analysis

In [22]:
sns.countplot(vehicle_df['class'])
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x2076914dca0>

Encode the dependent attribute

In [23]:
#Encoding of categorical variables
labelencoder_X=LabelEncoder()
vehicle_df['class']=labelencoder_X.fit_transform(vehicle_df['class'])

Independent Attributes

Multivariate Analysis

In [24]:
#correlation matrix
cor=vehicle_df.corr()
cor
Out[24]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
compactness 1.000000 0.684887 0.789928 0.721925 0.192864 0.499928 0.812620 -0.788750 0.813694 0.676143 0.769871 0.806170 0.585243 -0.246681 0.197308 0.156348 0.298537 0.365552 -0.033796
circularity 0.684887 1.000000 0.792320 0.638280 0.203253 0.560470 0.847938 -0.821472 0.843400 0.961318 0.802768 0.827462 0.925816 0.068745 0.136351 -0.009666 -0.104426 0.046351 -0.158910
distance_circularity 0.789928 0.792320 1.000000 0.794222 0.244332 0.666809 0.905076 -0.911307 0.893025 0.774527 0.869584 0.883943 0.705771 -0.229353 0.099107 0.262345 0.146098 0.332732 -0.064467
radius_ratio 0.721925 0.638280 0.794222 1.000000 0.650554 0.463958 0.769941 -0.825392 0.744139 0.579468 0.786183 0.760257 0.550774 -0.390459 0.035755 0.179601 0.405849 0.491758 -0.213948
pr.axis_aspect_ratio 0.192864 0.203253 0.244332 0.650554 1.000000 0.150295 0.194195 -0.298144 0.163047 0.147592 0.207101 0.196401 0.148591 -0.321070 -0.056030 -0.021088 0.400882 0.415734 -0.209298
max.length_aspect_ratio 0.499928 0.560470 0.666809 0.463958 0.150295 1.000000 0.490759 -0.504181 0.487931 0.642713 0.401391 0.463249 0.397397 -0.335444 0.081898 0.141664 0.083794 0.413174 0.352958
scatter_ratio 0.812620 0.847938 0.905076 0.769941 0.194195 0.490759 1.000000 -0.971601 0.989751 0.809083 0.960883 0.980447 0.799875 0.011314 0.064242 0.211647 0.005628 0.118817 -0.288895
elongatedness -0.788750 -0.821472 -0.911307 -0.825392 -0.298144 -0.504181 -0.971601 1.000000 -0.948996 -0.775854 -0.947644 -0.948851 -0.766314 0.078391 -0.046943 -0.183642 -0.115126 -0.216905 0.339344
pr.axis_rectangularity 0.813694 0.843400 0.893025 0.744139 0.163047 0.487931 0.989751 -0.948996 1.000000 0.810934 0.947329 0.973606 0.796690 0.027545 0.073127 0.213801 -0.018649 0.099286 -0.258481
max.length_rectangularity 0.676143 0.961318 0.774527 0.579468 0.147592 0.642713 0.809083 -0.775854 0.810934 1.000000 0.750222 0.789632 0.866450 0.053856 0.130702 0.004129 -0.103948 0.076770 -0.032399
scaled_variance 0.769871 0.802768 0.869584 0.786183 0.207101 0.401391 0.960883 -0.947644 0.947329 0.750222 1.000000 0.943780 0.785073 0.025828 0.024693 0.197122 0.015171 0.086330 -0.324062
scaled_variance.1 0.806170 0.827462 0.883943 0.760257 0.196401 0.463249 0.980447 -0.948851 0.973606 0.789632 0.943780 1.000000 0.782972 0.009386 0.065731 0.204941 0.017557 0.119642 -0.279487
scaled_radius_of_gyration 0.585243 0.925816 0.705771 0.550774 0.148591 0.397397 0.799875 -0.766314 0.796690 0.866450 0.785073 0.782972 1.000000 0.215279 0.162970 -0.055667 -0.224450 -0.118002 -0.250267
scaled_radius_of_gyration.1 -0.246681 0.068745 -0.229353 -0.390459 -0.321070 -0.335444 0.011314 0.078391 0.027545 0.053856 0.025828 0.009386 0.215279 1.000000 -0.057755 -0.123996 -0.832738 -0.901332 -0.283540
skewness_about 0.197308 0.136351 0.099107 0.035755 -0.056030 0.081898 0.064242 -0.046943 0.073127 0.130702 0.024693 0.065731 0.162970 -0.057755 1.000000 -0.041734 0.086661 0.062619 0.126720
skewness_about.1 0.156348 -0.009666 0.262345 0.179601 -0.021088 0.141664 0.211647 -0.183642 0.213801 0.004129 0.197122 0.204941 -0.055667 -0.123996 -0.041734 1.000000 0.074473 0.200651 -0.010872
skewness_about.2 0.298537 -0.104426 0.146098 0.405849 0.400882 0.083794 0.005628 -0.115126 -0.018649 -0.103948 0.015171 0.017557 -0.224450 -0.832738 0.086661 0.074473 1.000000 0.892581 0.067244
hollows_ratio 0.365552 0.046351 0.332732 0.491758 0.415734 0.413174 0.118817 -0.216905 0.099286 0.076770 0.086330 0.119642 -0.118002 -0.901332 0.062619 0.200651 0.892581 1.000000 0.235874
class -0.033796 -0.158910 -0.064467 -0.213948 -0.209298 0.352958 -0.288895 0.339344 -0.258481 -0.032399 -0.324062 -0.279487 -0.250267 -0.283540 0.126720 -0.010872 0.067244 0.235874 1.000000
In [25]:
# correlation plot---heatmap
sns.set(font_scale=1.15)
fig,ax=plt.subplots(figsize=(18,15))
sns.heatmap(cor,vmin=0.8, annot=True,linewidths=0.01,center=0,linecolor="white",cbar=False,square=True)
plt.title('Correlation between attributes',fontsize=18)
ax.tick_params(labelsize=18)
  • It shows that there are some attributes which are highly correlated as there corelation value is very high.
  • For example: compactness is highly correlated in a positive way with scatter_ratio, pr.axis_rectangularity, scaled_variance1, distance_circularity, scaled_variance, radius_ratio. compactness is highly correlated in a negative way with elongatedness.
In [26]:
#pair panel
sns.pairplot(vehicle_df,hue='class')
Out[26]:
<seaborn.axisgrid.PairGrid at 0x2076915d160>
  • It is also showing the same information as correlation matrix.
  • compactness has positive linear relationship with circularity, distance_circularity, radius_ratio, scatter_ratio, pr.axis_rectangularity, max.axis_rectangularity, scaled_variance1, scaled_variance. compactness has negative linear relationship with elongatedness.
  • circularity has positive linear relationship with distance_circularity, scatter_ratio, pr.axis_rectangularity, max.axis_rectangularity, scaled_variance, scaled_variance1, scaled_radius_of_gyration. circularity has negative linear relationship with elongatedness.
  • distance_circularity has positive linear relationship with radius_ratio, scatter_ratio, pr.axis_rectangularity, max.axis_rectangularity, scaled_variance, scaled_variance1, scaled_radius_of_gyration. distance_circularity has negative linear relationship with elongatedness.
  • radius_ratio has positive linear relationship with pr.axis_aspect_ratio, scatter_ratio, scaled_variance, scaled_variance1, scaled_radius_of_gyration. radius_ratio has negative linear relationship with elongatedness.

5. Without applying Dimensionality Reduction

Splitting the data into independent and dependent attributes

In [27]:
#independent and dependent variables
X=vehicle_df.iloc[:,0:18]
y = vehicle_df.iloc[:,18]

Splitting the data

In [28]:
# Split X and y into training and test set in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 10)

Model building

Support Vector Classifier

In [29]:
model = SVC()
model.fit(X_train, y_train)
prediction = model.predict(X_test)
In [30]:
# check the accuracy on the training data
print('Accuracy on Training data: ',model.score(X_train, y_train))
# check the accuracy on the testing data
print('Accuracy on Testing data: ', model.score(X_test , y_test))
#Calculate the recall value 
print('Recall value: ',metrics.recall_score(y_test, prediction, average='macro'))
#Calculate the precision value 
print('Precision value: ',metrics.precision_score(y_test, prediction, average='macro'))
print("Confusion Matrix:\n",metrics.confusion_matrix(prediction,y_test))
print("Classification Report:\n",metrics.classification_report(prediction,y_test))
Accuracy on Training data:  0.6621621621621622
Accuracy on Testing data:  0.6496062992125984
Recall value:  0.6198542982030112
Precision value:  0.6166822284469343
Confusion Matrix:
 [[34  4 22]
 [31 95  0]
 [ 6 26 36]]
Classification Report:
               precision    recall  f1-score   support

           0       0.48      0.57      0.52        60
           1       0.76      0.75      0.76       126
           2       0.62      0.53      0.57        68

    accuracy                           0.65       254
   macro avg       0.62      0.62      0.62       254
weighted avg       0.66      0.65      0.65       254

In [31]:
#Store the accuracy results for each kernel in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Model':['SVM'], 'Accuracy': model.score(X_test, y_test)},index={'3'})
tempResultsDf
Out[31]:
Model Accuracy
3 SVM 0.649606
In [32]:
result=[]
print("Accuracy with out applying PCA :",end=" ")
print(tempResultsDf["Accuracy"][0])
result.append(tempResultsDf["Accuracy"][0])
Accuracy with out applying PCA : 0.6496062992125984

Applying K fold cross validation

In [33]:
#Build the model with the best hyper parameters
model = SVC(C=0.5, kernel="linear")
scores = cross_val_score(model,X, y, cv=10)
print(scores)
print(np.mean(scores))
[0.91764706 0.91764706 0.97647059 0.91764706 0.95294118 0.91764706
 0.92857143 0.97619048 0.96428571 0.97619048]
0.9445238095238094
In [34]:
print("Cross validation score with out applying PCA :",end=" ")
print(np.mean(scores))
result.append(np.mean(scores))
Cross validation score with out applying PCA : 0.9445238095238094

6. Applying Dimensionality Reduction using PCA

Scaling

In [35]:
# Scaling the independent attributes using zscore
X_z=X.apply(zscore)
In [36]:
# prior to scaling
plt.rcParams['figure.figsize']=(10,6)
plt.plot(vehicle_df)
plt.show()
In [37]:
#plt.plot(X_z,figsize=(20,10))
plt.rcParams['figure.figsize']=(10,6)
plt.plot(X_z)
plt.show()
  • We can see that scaling has brought down axis and also it has brought every attrubutes on same scale.

Covariance

In [38]:
# Calculating the covariance between attributes after scaling
cov_matrix = np.cov(X_z.T)
print('Covariance Matrix \n%s', cov_matrix)
Covariance Matrix 
%s [[ 1.00118343  0.68569786  0.79086299  0.72277977  0.1930925   0.50051942
   0.81358214 -0.78968322  0.81465658  0.67694334  0.77078163  0.80712401
   0.58593517 -0.24697246  0.19754181  0.1565327   0.29889034  0.36598446]
 [ 0.68569786  1.00118343  0.79325751  0.63903532  0.20349327  0.5611334
   0.8489411  -0.82244387  0.84439802  0.96245572  0.80371846  0.82844154
   0.92691166  0.06882659  0.13651201 -0.00967793 -0.10455005  0.04640562]
 [ 0.79086299  0.79325751  1.00118343  0.79516215  0.24462154  0.66759792
   0.90614687 -0.9123854   0.89408198  0.77544391  0.87061349  0.88498924
   0.70660663 -0.22962442  0.09922417  0.26265581  0.14627113  0.33312625]
 [ 0.72277977  0.63903532  0.79516215  1.00118343  0.65132393  0.46450748
   0.77085211 -0.82636872  0.74502008  0.58015378  0.78711387  0.76115704
   0.55142559 -0.39092105  0.03579728  0.17981316  0.40632957  0.49234013]
 [ 0.1930925   0.20349327  0.24462154  0.65132393  1.00118343  0.15047265
   0.19442484 -0.29849719  0.16323988  0.14776643  0.20734569  0.19663295
   0.14876723 -0.32144977 -0.05609621 -0.02111342  0.401356    0.41622574]
 [ 0.50051942  0.5611334   0.66759792  0.46450748  0.15047265  1.00118343
   0.49133933 -0.50477756  0.48850876  0.64347365  0.40186618  0.46379685
   0.39786723 -0.33584133  0.08199536  0.14183116  0.08389276  0.41366325]
 [ 0.81358214  0.8489411   0.90614687  0.77085211  0.19442484  0.49133933
   1.00118343 -0.97275069  0.99092181  0.81004084  0.96201996  0.98160681
   0.80082111  0.01132718  0.06431825  0.21189733  0.00563439  0.1189581 ]
 [-0.78968322 -0.82244387 -0.9123854  -0.82636872 -0.29849719 -0.50477756
  -0.97275069  1.00118343 -0.95011894 -0.77677186 -0.94876596 -0.94997386
  -0.76722075  0.07848365 -0.04699819 -0.18385891 -0.11526213 -0.2171615 ]
 [ 0.81465658  0.84439802  0.89408198  0.74502008  0.16323988  0.48850876
   0.99092181 -0.95011894  1.00118343  0.81189327  0.94845027  0.97475823
   0.79763248  0.02757736  0.07321311  0.21405404 -0.01867064  0.09940372]
 [ 0.67694334  0.96245572  0.77544391  0.58015378  0.14776643  0.64347365
   0.81004084 -0.77677186  0.81189327  1.00118343  0.75110957  0.79056684
   0.86747579  0.05391989  0.13085669  0.00413356 -0.10407076  0.07686047]
 [ 0.77078163  0.80371846  0.87061349  0.78711387  0.20734569  0.40186618
   0.96201996 -0.94876596  0.94845027  0.75110957  1.00118343  0.94489677
   0.78600191  0.02585841  0.02472235  0.19735505  0.01518932  0.08643233]
 [ 0.80712401  0.82844154  0.88498924  0.76115704  0.19663295  0.46379685
   0.98160681 -0.94997386  0.97475823  0.79056684  0.94489677  1.00118343
   0.78389866  0.00939688  0.0658085   0.20518392  0.01757781  0.11978365]
 [ 0.58593517  0.92691166  0.70660663  0.55142559  0.14876723  0.39786723
   0.80082111 -0.76722075  0.79763248  0.86747579  0.78600191  0.78389866
   1.00118343  0.21553366  0.16316265 -0.05573322 -0.22471583 -0.11814142]
 [-0.24697246  0.06882659 -0.22962442 -0.39092105 -0.32144977 -0.33584133
   0.01132718  0.07848365  0.02757736  0.05391989  0.02585841  0.00939688
   0.21553366  1.00118343 -0.05782288 -0.12414277 -0.83372383 -0.90239877]
 [ 0.19754181  0.13651201  0.09922417  0.03579728 -0.05609621  0.08199536
   0.06431825 -0.04699819  0.07321311  0.13085669  0.02472235  0.0658085
   0.16316265 -0.05782288  1.00118343 -0.04178316  0.0867631   0.06269293]
 [ 0.1565327  -0.00967793  0.26265581  0.17981316 -0.02111342  0.14183116
   0.21189733 -0.18385891  0.21405404  0.00413356  0.19735505  0.20518392
  -0.05573322 -0.12414277 -0.04178316  1.00118343  0.07456104  0.20088894]
 [ 0.29889034 -0.10455005  0.14627113  0.40632957  0.401356    0.08389276
   0.00563439 -0.11526213 -0.01867064 -0.10407076  0.01518932  0.01757781
  -0.22471583 -0.83372383  0.0867631   0.07456104  1.00118343  0.89363767]
 [ 0.36598446  0.04640562  0.33312625  0.49234013  0.41622574  0.41366325
   0.1189581  -0.2171615   0.09940372  0.07686047  0.08643233  0.11978365
  -0.11814142 -0.90239877  0.06269293  0.20088894  0.89363767  1.00118343]]
  • Covariance tells about information contained in the mathematical space between the independent attributes.

Eigenvalues and Eigenvectors

In [39]:
#Finding eigenvalues amd eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
print('Eigen Vectors \n%s', eigenvectors)
print('\n Eigen Values \n%s', eigenvalues)
Eigen Vectors 
%s [[-2.72502890e-01 -8.70435783e-02  3.81852075e-02  1.38675013e-01
  -1.37101466e-01  2.63611383e-01  2.02717114e-01 -7.58796410e-01
   3.66685918e-01  1.60045219e-01  8.40252779e-02  2.14645175e-02
  -1.87350749e-02  6.89082276e-02  4.26105276e-02  9.97784975e-02
  -8.22590084e-02 -3.30366937e-02]
 [-2.87254690e-01  1.31621757e-01  2.01146908e-01 -3.80554832e-02
   1.38995553e-01 -7.13474241e-02 -3.92275358e-01 -6.76034223e-02
   5.53261885e-02 -1.82323962e-01 -3.65229874e-02  1.47247511e-01
  -4.89102355e-02  5.90534770e-02 -6.74107885e-01  1.63466948e-01
  -2.59100771e-01  2.48832011e-01]
 [-3.02421105e-01 -4.61430061e-02 -6.34621085e-02  1.08954287e-01
   8.00174278e-02 -1.69006151e-02  1.63371282e-01  2.77371950e-01
   7.46784853e-02  2.73033778e-01  4.68505530e-01  6.52730855e-01
   4.74162132e-03 -1.62108150e-01 -4.99754439e-04 -6.36582307e-02
   1.20629778e-01  9.80561531e-02]
 [-2.69713545e-01 -1.97931263e-01 -5.62851689e-02 -2.54355087e-01
  -1.33744367e-01 -1.38183653e-01  1.61910525e-01  1.10544748e-01
   2.66666666e-01 -5.05987218e-02 -5.45526034e-01  7.52188680e-02
   3.70499547e-03 -3.93288246e-01  1.74861248e-01 -1.33284415e-01
  -1.86241567e-01  3.60765151e-01]
 [-9.78607336e-02 -2.57839952e-01  6.19927464e-02 -6.12765722e-01
  -1.23601456e-01 -5.77828612e-01  9.27633094e-02 -1.86858758e-01
  -3.86296562e-02 -3.43037888e-02  2.65023238e-01 -2.40287269e-02
   8.90928349e-03  1.63771153e-01 -6.31976228e-02  2.14665592e-02
   1.24639367e-01 -1.77647590e-01]
 [-1.95200137e-01 -1.08045626e-01  1.48957820e-01  2.78678159e-01
   6.34893356e-01 -2.89096995e-01  3.98266293e-01 -4.62187969e-02
  -1.37163365e-01  1.77960797e-01 -1.92846020e-01 -2.29741488e-01
   4.09727876e-03  1.36576102e-01 -9.62482815e-02 -6.89934316e-02
   1.40804371e-01  9.99006987e-02]
 [-3.10523932e-01  7.52853487e-02 -1.09042833e-01  5.39294828e-03
  -8.55574543e-02  9.77471088e-02  9.23519412e-02  6.46204209e-02
  -1.31567659e-01 -1.43132644e-01  9.67172431e-02 -1.53118496e-01
   8.55513044e-01  6.48917601e-02 -4.36596954e-02 -1.56585696e-01
  -1.43109720e-01 -5.28457504e-02]
 [ 3.09006904e-01 -1.32299375e-02  9.08526930e-02  6.52148575e-02
   7.90734442e-02 -7.57282937e-02 -1.04070600e-01 -1.92342823e-01
   2.89633509e-01 -7.93831124e-02 -2.29926427e-02  2.33454000e-02
   2.61858734e-01 -4.96273257e-01 -3.08568675e-01 -2.44030327e-01
   5.11966770e-01 -9.49906147e-02]
 [-3.07287000e-01  8.75601978e-02 -1.06070496e-01  3.08991500e-02
  -8.16463820e-02  1.05403228e-01  9.31317767e-02  1.38684573e-02
  -8.95291026e-02 -2.39896699e-01  1.59356923e-01 -2.17636238e-01
  -4.22479708e-01 -1.13664100e-01 -1.63739102e-01 -6.71547392e-01
  -6.75916711e-02 -2.16727165e-01]
 [-2.78154157e-01  1.22154240e-01  2.13684693e-01  4.14674720e-02
   2.51112937e-01 -7.81962142e-02 -3.54564344e-01 -2.15163418e-01
  -1.58231983e-01 -3.82739482e-01 -1.42837015e-01  3.15261003e-01
   2.00493082e-02 -8.66067604e-03  5.08763287e-01 -5.00643538e-02
   1.60926059e-01 -2.00262071e-01]
 [-2.99765086e-01  7.72657535e-02 -1.44599805e-01 -6.40050869e-02
  -1.47471227e-01  1.32912405e-01  6.80546125e-02  1.95678724e-01
   4.27034669e-02  1.66090908e-01 -4.59667614e-01  1.18383161e-01
  -4.15194745e-02  1.35985919e-01 -2.52182911e-01  2.17416166e-01
   3.24139804e-01 -5.53139002e-01]
 [-3.05532374e-01  7.15030171e-02 -1.10343735e-01 -2.19687048e-03
  -1.10100984e-01  1.15398218e-01  9.01194270e-02  3.77948210e-02
  -1.51072666e-01 -2.87457686e-01  2.09345615e-01 -3.31340876e-01
  -1.22365190e-01 -2.42922436e-01  3.94502237e-02  4.48936624e-01
   4.62827872e-01  3.22499534e-01]
 [-2.63237620e-01  2.10582046e-01  2.02870191e-01 -8.55396458e-02
   5.21210685e-03 -6.70573978e-02 -4.55292717e-01  1.46752664e-01
   2.63771332e-01  5.49626527e-01  1.07713508e-01 -3.99260390e-01
   1.66056546e-02 -3.30876118e-02  2.03029913e-01 -1.06621517e-01
   8.55669069e-02  2.40609291e-02]
 [ 4.19359352e-02  5.03621577e-01 -7.38640211e-02 -1.15399624e-01
  -1.38068605e-01 -1.31513077e-01  8.58226790e-02 -3.30394999e-01
  -5.55267166e-01  3.62547303e-01 -1.26596148e-01  1.21942784e-01
   1.27186667e-03 -2.96030848e-01 -5.79407509e-02 -3.08034829e-02
  -5.10909842e-02  8.79644677e-02]
 [-3.60832115e-02 -1.57663214e-02  5.59173987e-01  4.73703309e-01
  -5.66552244e-01 -3.19176094e-01  1.24532179e-01  1.14255395e-01
  -5.99039250e-02 -5.79891873e-02 -3.25785780e-02  2.88590518e-03
  -4.24341185e-04  4.01635562e-03 -8.22261600e-03  2.05544442e-02
  -4.39201991e-03 -3.76172016e-02]
 [-5.87204797e-02 -9.27462386e-02 -6.70680496e-01  4.28426032e-01
  -1.30869817e-01 -4.68404967e-01 -3.02517700e-01 -1.15403870e-01
   5.23845772e-02  1.28995278e-02 -3.62255133e-02 -1.62495314e-02
  -9.40554994e-03  8.00562035e-02  1.12172401e-02 -2.31296836e-03
   1.13702813e-02  4.44850199e-02]
 [-3.80131449e-02 -5.01621218e-01  6.22407145e-02 -2.74095968e-02
  -1.80519293e-01  2.80136438e-01 -2.58250261e-01 -9.46599623e-02
  -3.79168935e-01  1.87848521e-01 -1.38657118e-01  8.24506703e-02
   2.60800892e-02  2.45816461e-01 -7.88567114e-02 -2.81093089e-01
   3.19960307e-01  3.19055407e-01]
 [-8.47399995e-02 -5.07612106e-01  4.17053530e-02  9.60374943e-02
   1.10788067e-01  5.94444089e-02 -1.73269228e-01 -6.49718344e-03
  -2.80340510e-01  1.33402674e-01  8.39926899e-02 -1.29951586e-01
  -4.18109835e-03 -5.18420304e-01 -3.18514877e-02  2.41164948e-01
  -3.10989286e-01 -3.65128378e-01]]

 Eigen Values 
%s [9.74940269e+00 3.35071912e+00 1.19238155e+00 1.13381916e+00
 8.83997312e-01 6.66265745e-01 3.18150910e-01 2.28179142e-01
 1.31018595e-01 7.98619108e-02 7.33979478e-02 6.46162669e-02
 5.16287320e-03 4.01448646e-02 1.98136761e-02 2.27005257e-02
 3.22758478e-02 2.93936408e-02]
  • Eigenvectors are the new dimensions of the new mathematical space.
  • Eigenvalues are the information content of each one of these eigenvectors or we can say it is the spread of these eigenvectors.
In [40]:
# Make a set of (eigenvalue, eigenvector) pairs
eigen_pairs = [(np.abs(eigenvalues[i]), eigenvectors[:,i]) for i in range(len(eigenvalues))]
eigen_pairs.sort(reverse=True)
eigen_pairs[:]
Out[40]:
[(9.7494026893796,
  array([-0.27250289, -0.28725469, -0.30242111, -0.26971354, -0.09786073,
         -0.19520014, -0.31052393,  0.3090069 , -0.307287  , -0.27815416,
         -0.29976509, -0.30553237, -0.26323762,  0.04193594, -0.03608321,
         -0.05872048, -0.03801314, -0.08474   ])),
 (3.3507191194129797,
  array([-0.08704358,  0.13162176, -0.04614301, -0.19793126, -0.25783995,
         -0.10804563,  0.07528535, -0.01322994,  0.0875602 ,  0.12215424,
          0.07726575,  0.07150302,  0.21058205,  0.50362158, -0.01576632,
         -0.09274624, -0.50162122, -0.50761211])),
 (1.1923815452731603,
  array([ 0.03818521,  0.20114691, -0.06346211, -0.05628517,  0.06199275,
          0.14895782, -0.10904283,  0.09085269, -0.1060705 ,  0.21368469,
         -0.1445998 , -0.11034374,  0.20287019, -0.07386402,  0.55917399,
         -0.6706805 ,  0.06224071,  0.04170535])),
 (1.1338191632147865,
  array([ 0.13867501, -0.03805548,  0.10895429, -0.25435509, -0.61276572,
          0.27867816,  0.00539295,  0.06521486,  0.03089915,  0.04146747,
         -0.06400509, -0.00219687, -0.08553965, -0.11539962,  0.47370331,
          0.42842603, -0.0274096 ,  0.09603749])),
 (0.8839973120036124,
  array([-0.13710147,  0.13899555,  0.08001743, -0.13374437, -0.12360146,
          0.63489336, -0.08555745,  0.07907344, -0.08164638,  0.25111294,
         -0.14747123, -0.11010098,  0.00521211, -0.1380686 , -0.56655224,
         -0.13086982, -0.18051929,  0.11078807])),
 (0.6662657454310791,
  array([ 0.26361138, -0.07134742, -0.01690062, -0.13818365, -0.57782861,
         -0.289097  ,  0.09774711, -0.07572829,  0.10540323, -0.07819621,
          0.1329124 ,  0.11539822, -0.0670574 , -0.13151308, -0.31917609,
         -0.46840497,  0.28013644,  0.05944441])),
 (0.3181509095843841,
  array([ 0.20271711, -0.39227536,  0.16337128,  0.16191053,  0.09276331,
          0.39826629,  0.09235194, -0.1040706 ,  0.09313178, -0.35456434,
          0.06805461,  0.09011943, -0.45529272,  0.08582268,  0.12453218,
         -0.3025177 , -0.25825026, -0.17326923])),
 (0.2281791421155399,
  array([-0.75879641, -0.06760342,  0.27737195,  0.11054475, -0.18685876,
         -0.0462188 ,  0.06462042, -0.19234282,  0.01386846, -0.21516342,
          0.19567872,  0.03779482,  0.14675266, -0.330395  ,  0.1142554 ,
         -0.11540387, -0.09465996, -0.00649718])),
 (0.13101859512585415,
  array([ 0.36668592,  0.05532619,  0.07467849,  0.26666667, -0.03862966,
         -0.13716337, -0.13156766,  0.28963351, -0.0895291 , -0.15823198,
          0.04270347, -0.15107267,  0.26377133, -0.55526717, -0.05990393,
          0.05238458, -0.37916894, -0.28034051])),
 (0.0798619108203654,
  array([ 0.16004522, -0.18232396,  0.27303378, -0.05059872, -0.03430379,
          0.1779608 , -0.14313264, -0.07938311, -0.2398967 , -0.38273948,
          0.16609091, -0.28745769,  0.54962653,  0.3625473 , -0.05798919,
          0.01289953,  0.18784852,  0.13340267])),
 (0.07339794782509301,
  array([ 0.08402528, -0.03652299,  0.46850553, -0.54552603,  0.26502324,
         -0.19284602,  0.09671724, -0.02299264,  0.15935692, -0.14283702,
         -0.45966761,  0.20934562,  0.10771351, -0.12659615, -0.03257858,
         -0.03622551, -0.13865712,  0.08399269])),
 (0.06461626687535531,
  array([ 0.02146452,  0.14724751,  0.65273085,  0.07521887, -0.02402873,
         -0.22974149, -0.1531185 ,  0.0233454 , -0.21763624,  0.315261  ,
          0.11838316, -0.33134088, -0.39926039,  0.12194278,  0.00288591,
         -0.01624953,  0.08245067, -0.12995159])),
 (0.040144864577102306,
  array([ 0.06890823,  0.05905348, -0.16210815, -0.39328825,  0.16377115,
          0.1365761 ,  0.06489176, -0.49627326, -0.1136641 , -0.00866068,
          0.13598592, -0.24292244, -0.03308761, -0.29603085,  0.00401636,
          0.0800562 ,  0.24581646, -0.5184203 ])),
 (0.03227584776690066,
  array([-0.08225901, -0.25910077,  0.12062978, -0.18624157,  0.12463937,
          0.14080437, -0.14310972,  0.51196677, -0.06759167,  0.16092606,
          0.3241398 ,  0.46282787,  0.08556691, -0.05109098, -0.00439202,
          0.01137028,  0.31996031, -0.31098929])),
 (0.029393640750313605,
  array([-0.03303669,  0.24883201,  0.09805615,  0.36076515, -0.17764759,
          0.0999007 , -0.05284575, -0.09499061, -0.21672717, -0.20026207,
         -0.553139  ,  0.32249953,  0.02406093,  0.08796447, -0.0376172 ,
          0.04448502,  0.31905541, -0.36512838])),
 (0.022700525706220828,
  array([ 0.0997785 ,  0.16346695, -0.06365823, -0.13328441,  0.02146656,
         -0.06899343, -0.1565857 , -0.24403033, -0.67154739, -0.05006435,
          0.21741617,  0.44893662, -0.10662152, -0.03080348,  0.02055444,
         -0.00231297, -0.28109309,  0.24116495])),
 (0.019813676080862888,
  array([ 4.26105276e-02, -6.74107885e-01, -4.99754439e-04,  1.74861248e-01,
         -6.31976228e-02, -9.62482815e-02, -4.36596954e-02, -3.08568675e-01,
         -1.63739102e-01,  5.08763287e-01, -2.52182911e-01,  3.94502237e-02,
          2.03029913e-01, -5.79407509e-02, -8.22261600e-03,  1.12172401e-02,
         -7.88567114e-02, -3.18514877e-02])),
 (0.0051628732047465965,
  array([-1.87350749e-02, -4.89102355e-02,  4.74162132e-03,  3.70499547e-03,
          8.90928349e-03,  4.09727876e-03,  8.55513044e-01,  2.61858734e-01,
         -4.22479708e-01,  2.00493082e-02, -4.15194745e-02, -1.22365190e-01,
          1.66056546e-02,  1.27186667e-03, -4.24341185e-04, -9.40554994e-03,
          2.60800892e-02, -4.18109835e-03]))]
In [41]:
# print out eigenvalues
print('Eigenvalues in descending order: \n%s' %eigenvalues)
Eigenvalues in descending order: 
[9.74940269e+00 3.35071912e+00 1.19238155e+00 1.13381916e+00
 8.83997312e-01 6.66265745e-01 3.18150910e-01 2.28179142e-01
 1.31018595e-01 7.98619108e-02 7.33979478e-02 6.46162669e-02
 5.16287320e-03 4.01448646e-02 1.98136761e-02 2.27005257e-02
 3.22758478e-02 2.93936408e-02]

Finding variance and cummulative variance by each eigenvector

In [42]:
tot = sum(eigenvalues)
var_exp = [( i /tot ) * 100 for i in sorted(eigenvalues, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
print("Cumulative Variance Explained", cum_var_exp)
plt.plot(var_exp)
Cumulative Variance Explained [ 54.0993254   72.69242795  79.30893968  85.60048941  90.50578051
  94.2028816   95.96829741  97.23446089  97.96148159  98.40463444
  98.81191882  99.17047375  99.39323715  99.57233547  99.73544045
  99.86140541  99.97135127 100.        ]
Out[42]:
[<matplotlib.lines.Line2D at 0x20779d71b50>]
  • We can observe that their is steep drop in variance explained with increase in number the number of Principal Components.
  • Also, top 10 principal components contribute 98.5% variance (information). So, we can proceed further with 10 components.
In [43]:
# Ploting 
plt.figure(figsize=(8 , 7))
plt.bar(range(1, eigenvalues.size + 1), var_exp, alpha = 0.5, align = 'center', label = 'Individual explained variance')
plt.step(range(1, eigenvalues.size + 1), cum_var_exp, where='mid', label = 'Cumulative explained variance')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.legend(loc = 'best')
plt.tight_layout()
plt.show()
  • Here also we can visualize and eliminiate last 7 principal components because almost 98.5% of the data is captured by 10 principal components.
In [44]:
# Reducing from 17 to 10 dimension space
pca = PCA(n_components=10)
data_reduced = pca.fit_transform(X_z)
data_reduced.transpose()
Out[44]:
array([[ 0.58422804, -1.5121798 ,  3.91344816, ...,  5.12009307,
        -3.29709502, -4.96759448],
       [-0.67567325, -0.34893367,  0.2345073 , ..., -0.18227007,
        -1.10194286,  0.42274968],
       [-0.45333356, -0.33343619, -1.26509352, ..., -0.50836783,
         1.93384417,  1.30871531],
       ...,
       [-0.68196902,  0.10442512,  0.17305277, ..., -0.38820845,
         0.45880709, -0.21433678],
       [ 0.31266966, -0.29625823,  0.19108534, ..., -0.07735512,
         0.82142229,  0.59676772],
       [ 0.14411602, -0.39097765, -0.52948668, ...,  0.55527162,
        -0.34059305,  0.10856429]])
In [45]:
pca.components_
Out[45]:
array([[ 0.27250289,  0.28725469,  0.30242111,  0.26971354,  0.09786073,
         0.19520014,  0.31052393, -0.3090069 ,  0.307287  ,  0.27815416,
         0.29976509,  0.30553237,  0.26323762, -0.04193594,  0.03608321,
         0.05872048,  0.03801314,  0.08474   ],
       [-0.08704358,  0.13162176, -0.04614301, -0.19793126, -0.25783995,
        -0.10804563,  0.07528535, -0.01322994,  0.0875602 ,  0.12215424,
         0.07726575,  0.07150302,  0.21058205,  0.50362158, -0.01576632,
        -0.09274624, -0.50162122, -0.50761211],
       [-0.03818521, -0.20114691,  0.06346211,  0.05628517, -0.06199275,
        -0.14895782,  0.10904283, -0.09085269,  0.1060705 , -0.21368469,
         0.1445998 ,  0.11034374, -0.20287019,  0.07386402, -0.55917399,
         0.6706805 , -0.06224071, -0.04170535],
       [ 0.13867501, -0.03805548,  0.10895429, -0.25435509, -0.61276572,
         0.27867816,  0.00539295,  0.06521486,  0.03089915,  0.04146747,
        -0.06400509, -0.00219687, -0.08553965, -0.11539962,  0.47370331,
         0.42842603, -0.0274096 ,  0.09603749],
       [ 0.13710147, -0.13899555, -0.08001743,  0.13374437,  0.12360146,
        -0.63489336,  0.08555745, -0.07907344,  0.08164638, -0.25111294,
         0.14747123,  0.11010098, -0.00521211,  0.1380686 ,  0.56655224,
         0.13086982,  0.18051929, -0.11078807],
       [ 0.26361138, -0.07134742, -0.01690062, -0.13818365, -0.57782861,
        -0.289097  ,  0.09774711, -0.07572829,  0.10540323, -0.07819621,
         0.1329124 ,  0.11539822, -0.0670574 , -0.13151308, -0.31917609,
        -0.46840497,  0.28013644,  0.05944441],
       [ 0.20271711, -0.39227536,  0.16337128,  0.16191053,  0.09276331,
         0.39826629,  0.09235194, -0.1040706 ,  0.09313178, -0.35456434,
         0.06805461,  0.09011943, -0.45529272,  0.08582268,  0.12453218,
        -0.3025177 , -0.25825026, -0.17326923],
       [-0.75879641, -0.06760342,  0.27737195,  0.11054475, -0.18685876,
        -0.0462188 ,  0.06462042, -0.19234282,  0.01386846, -0.21516342,
         0.19567872,  0.03779482,  0.14675266, -0.330395  ,  0.1142554 ,
        -0.11540387, -0.09465996, -0.00649718],
       [ 0.36668592,  0.05532619,  0.07467849,  0.26666667, -0.03862966,
        -0.13716337, -0.13156766,  0.28963351, -0.0895291 , -0.15823198,
         0.04270347, -0.15107267,  0.26377133, -0.55526717, -0.05990393,
         0.05238458, -0.37916894, -0.28034051],
       [-0.16004522,  0.18232396, -0.27303378,  0.05059872,  0.03430379,
        -0.1779608 ,  0.14313264,  0.07938311,  0.2398967 ,  0.38273948,
        -0.16609091,  0.28745769, -0.54962653, -0.3625473 ,  0.05798919,
        -0.01289953, -0.18784852, -0.13340267]])
In [46]:
X_comp = pd.DataFrame(pca.components_,columns=list(X_z))
X_comp.head()
Out[46]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio
0 0.272503 0.287255 0.302421 0.269714 0.097861 0.195200 0.310524 -0.309007 0.307287 0.278154 0.299765 0.305532 0.263238 -0.041936 0.036083 0.058720 0.038013 0.084740
1 -0.087044 0.131622 -0.046143 -0.197931 -0.257840 -0.108046 0.075285 -0.013230 0.087560 0.122154 0.077266 0.071503 0.210582 0.503622 -0.015766 -0.092746 -0.501621 -0.507612
2 -0.038185 -0.201147 0.063462 0.056285 -0.061993 -0.148958 0.109043 -0.090853 0.106070 -0.213685 0.144600 0.110344 -0.202870 0.073864 -0.559174 0.670680 -0.062241 -0.041705
3 0.138675 -0.038055 0.108954 -0.254355 -0.612766 0.278678 0.005393 0.065215 0.030899 0.041467 -0.064005 -0.002197 -0.085540 -0.115400 0.473703 0.428426 -0.027410 0.096037
4 0.137101 -0.138996 -0.080017 0.133744 0.123601 -0.634893 0.085557 -0.079073 0.081646 -0.251113 0.147471 0.110101 -0.005212 0.138069 0.566552 0.130870 0.180519 -0.110788
In [47]:
# P_reduce represents reduced mathematical space.
# Reducing from 17 to 10 dimension space
P_reduce = np.array(eigenvectors[0:10])   
# projecting original data into principal component dimensions
X_std_10D = np.dot(X_z,P_reduce.T)   
# converting array to dataframe for pairplot
Proj_data_df = pd.DataFrame(X_std_10D)  
In [48]:
#Let us check it visually
sns.pairplot(Proj_data_df, diag_kind='kde') 
Out[48]:
<seaborn.axisgrid.PairGrid at 0x207761109d0>
  • Now, there are almost no correlation between independent attributes but there are some attributes which shows some correlation. The reason behind this is that some attributes in data are less correlated but we still taken that for dimentionality reduction. The solution may be we can remove columns which are less correlated then apply PCA.

Splitting the data

In [49]:
# Split X and y into training and test set in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(Proj_data_df,y, test_size = 0.3, random_state = 10)

Model Building

Support Vector Classifier

In [50]:
model = SVC()
model.fit(X_train, y_train)
prediction = model.predict(X_test)
In [51]:
# check the accuracy on the training data
print('Accuracy on Training data: ',model.score(X_train, y_train))
# check the accuracy on the testing data
print('Accuracy on Testing data: ', model.score(X_test , y_test))
#Calculate the recall value 
print('Recall value: ',metrics.recall_score(y_test, prediction, average='macro'))
#Calculate the precision value 
print('Precision value: ',metrics.precision_score(y_test, prediction, average='macro'))
print("Confusion Matrix:\n",metrics.confusion_matrix(prediction,y_test))
print("Classification Report:\n",metrics.classification_report(prediction,y_test))
Accuracy on Training data:  0.9510135135135135
Accuracy on Testing data:  0.937007874015748
Recall value:  0.934870649182451
Precision value:  0.9311268479985285
Confusion Matrix:
 [[ 66   2   1]
 [  3 118   3]
 [  2   5  54]]
Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.96      0.94        69
           1       0.94      0.95      0.95       124
           2       0.93      0.89      0.91        61

    accuracy                           0.94       254
   macro avg       0.93      0.93      0.93       254
weighted avg       0.94      0.94      0.94       254

In [52]:
#Store the accuracy results for each kernel in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Model':['SVM'], 'Accuracy': model.score(X_test, y_test)},index={'4'})
tempResultsDf
Out[52]:
Model Accuracy
4 SVM 0.937008
In [53]:
print("Accuracy with out applying PCA :",end=" ")
print(tempResultsDf["Accuracy"][0])
result.append(tempResultsDf["Accuracy"][0])
Accuracy with out applying PCA : 0.937007874015748

Using Grid Search to tune model parameters

In [54]:
#Grid search to tune model parameters for SVC
from sklearn.model_selection import GridSearchCV
model = SVC()
params = {'C': [0.01, 0.1, 0.5, 1], 'kernel': ['linear', 'rbf']}
model1 = GridSearchCV(model, param_grid=params, verbose=5)
model1.fit(X_train, y_train)
print("Best Hyper Parameters:\n", model1.best_params_)
Fitting 5 folds for each of 8 candidates, totalling 40 fits
[CV] C=0.01, kernel=linear ...........................................
[CV] ............... C=0.01, kernel=linear, score=0.765, total=   0.0s
[CV] C=0.01, kernel=linear ...........................................
[CV] ............... C=0.01, kernel=linear, score=0.697, total=   0.0s
[CV] C=0.01, kernel=linear ...........................................
[CV] ............... C=0.01, kernel=linear, score=0.746, total=   0.0s
[CV] C=0.01, kernel=linear ...........................................
[CV] ............... C=0.01, kernel=linear, score=0.754, total=   0.0s
[CV] C=0.01, kernel=linear ...........................................
[CV] ............... C=0.01, kernel=linear, score=0.712, total=   0.0s
[CV] C=0.01, kernel=rbf ..............................................
[CV] .................. C=0.01, kernel=rbf, score=0.513, total=   0.0s
[CV] C=0.01, kernel=rbf ..............................................
[CV] .................. C=0.01, kernel=rbf, score=0.513, total=   0.0s
[CV] C=0.01, kernel=rbf ..............................................
[CV] .................. C=0.01, kernel=rbf, score=0.517, total=   0.0s
[CV] C=0.01, kernel=rbf ..............................................
[CV] .................. C=0.01, kernel=rbf, score=0.517, total=   0.0s
[CV] C=0.01, kernel=rbf ..............................................
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.0s remaining:    0.0s
[CV] .................. C=0.01, kernel=rbf, score=0.508, total=   0.0s
[CV] C=0.1, kernel=linear ............................................
[CV] ................ C=0.1, kernel=linear, score=0.857, total=   0.0s
[CV] C=0.1, kernel=linear ............................................
[CV] ................ C=0.1, kernel=linear, score=0.798, total=   0.0s
[CV] C=0.1, kernel=linear ............................................
[CV] ................ C=0.1, kernel=linear, score=0.847, total=   0.0s
[CV] C=0.1, kernel=linear ............................................
[CV] ................ C=0.1, kernel=linear, score=0.941, total=   0.0s
[CV] C=0.1, kernel=linear ............................................
[CV] ................ C=0.1, kernel=linear, score=0.805, total=   0.0s
[CV] C=0.1, kernel=rbf ...............................................
[CV] ................... C=0.1, kernel=rbf, score=0.815, total=   0.0s
[CV] C=0.1, kernel=rbf ...............................................
[CV] ................... C=0.1, kernel=rbf, score=0.756, total=   0.0s
[CV] C=0.1, kernel=rbf ...............................................
[CV] ................... C=0.1, kernel=rbf, score=0.797, total=   0.0s
[CV] C=0.1, kernel=rbf ...............................................
[CV] ................... C=0.1, kernel=rbf, score=0.831, total=   0.0s
[CV] C=0.1, kernel=rbf ...............................................
[CV] ................... C=0.1, kernel=rbf, score=0.737, total=   0.0s
[CV] C=0.5, kernel=linear ............................................
[CV] ................ C=0.5, kernel=linear, score=0.866, total=   0.0s
[CV] C=0.5, kernel=linear ............................................
[CV] ................ C=0.5, kernel=linear, score=0.790, total=   0.0s
[CV] C=0.5, kernel=linear ............................................
[CV] ................ C=0.5, kernel=linear, score=0.839, total=   0.0s
[CV] C=0.5, kernel=linear ............................................
[CV] ................ C=0.5, kernel=linear, score=0.932, total=   0.0s
[CV] C=0.5, kernel=linear ............................................
[CV] ................ C=0.5, kernel=linear, score=0.780, total=   0.0s
[CV] C=0.5, kernel=rbf ...............................................
[CV] ................... C=0.5, kernel=rbf, score=0.882, total=   0.0s
[CV] C=0.5, kernel=rbf ...............................................
[CV] ................... C=0.5, kernel=rbf, score=0.891, total=   0.0s
[CV] C=0.5, kernel=rbf ...............................................
[CV] ................... C=0.5, kernel=rbf, score=0.924, total=   0.0s
[CV] C=0.5, kernel=rbf ...............................................
[CV] ................... C=0.5, kernel=rbf, score=0.949, total=   0.0s
[CV] C=0.5, kernel=rbf ...............................................
[CV] ................... C=0.5, kernel=rbf, score=0.898, total=   0.0s
[CV] C=1, kernel=linear ..............................................
[CV] .................. C=1, kernel=linear, score=0.874, total=   0.0s
[CV] C=1, kernel=linear ..............................................
[CV] .................. C=1, kernel=linear, score=0.798, total=   0.0s
[CV] C=1, kernel=linear ..............................................
[CV] .................. C=1, kernel=linear, score=0.847, total=   0.0s
[CV] C=1, kernel=linear ..............................................
[CV] .................. C=1, kernel=linear, score=0.932, total=   0.0s
[CV] C=1, kernel=linear ..............................................
[CV] .................. C=1, kernel=linear, score=0.797, total=   0.0s
[CV] C=1, kernel=rbf .................................................
[CV] ..................... C=1, kernel=rbf, score=0.882, total=   0.0s
[CV] C=1, kernel=rbf .................................................
[CV] ..................... C=1, kernel=rbf, score=0.916, total=   0.0s
[CV] C=1, kernel=rbf .................................................
[CV] ..................... C=1, kernel=rbf, score=0.949, total=   0.0s
[CV] C=1, kernel=rbf .................................................
[CV] ..................... C=1, kernel=rbf, score=0.975, total=   0.0s
[CV] C=1, kernel=rbf .................................................
[CV] ..................... C=1, kernel=rbf, score=0.907, total=   0.0s
Best Hyper Parameters:
 {'C': 1, 'kernel': 'rbf'}
[Parallel(n_jobs=1)]: Done  40 out of  40 | elapsed:    0.5s finished

Using k fold cross validation in SVM

In [55]:
#Build the model with the best hyper parameters
model = SVC(C=0.5, kernel="linear")
scores = cross_val_score(model, Proj_data_df, y, cv=10)
print(scores)
print(np.mean(scores))
[0.81176471 0.8        0.85882353 0.85882353 0.87058824 0.84705882
 0.85714286 0.9047619  0.88095238 0.88095238]
0.8570868347338936
In [56]:
#Store the accuracy results for each kernel in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Model':['SVM k fold'], 'Accuracy': np.mean(scores)},index={'5'})
tempResultsDf
Out[56]:
Model Accuracy
5 SVM k fold 0.857087
In [57]:
print("Cross validation score with PCA :",end=" ")
print(np.mean(scores))
result.append(np.mean(scores))
Cross validation score with PCA : 0.8570868347338936
In [58]:
result
Out[58]:
[0.6496062992125984, 0.9445238095238094, 0.937007874015748, 0.8570868347338936]

Result

In [59]:
print("Accuracy score without PCA :",result[0])
print("Cross validation score without PCA :",result[1])
print("Accuracy score with PCA :",result[2])
print("Cross validation score with PCA :",result[3])
Accuracy score without PCA : 0.6496062992125984
Cross validation score without PCA : 0.9445238095238094
Accuracy score with PCA : 0.937007874015748
Cross validation score with PCA : 0.8570868347338936

Accuracy is getting increased when PCA is implemented

Cross validation score is getting decreased when PCA is implemented

This may be bacuse PCA tunes the model to much better state to face the test data.